A CDK-based random forest model for Abraham descriptors.
Researchers: Andrew SID Lang
Introduction
The Abraham general solvation models can used to predict partition coefficients via the linear fre energy relationships:
log P = c + e*E + s*S + a*A + b*B + v*V
log K = c + e*E + s*S + a*A + b*B + l*L
where c, e, s, a, b, v, and l are the solvent coefficients and E, S, A, B, V, and L and the solute descriptors.
Here we use Open Data to build random forest models that can be used to predict E, S, A, B, V, and L.
Procedure
We took the latest available data of compounds with known Abraham descriptors [1] and combined it with compounds with Abraham descriptors calculated on-the-fly from ONSChallenge solubility data [2]. This resulted in a dataset (xlsx) of 3400 rows. CDK descriptors were then calculated using Rajarshi Guha's CDK Desc Calc Gui (v1.3.9). In order to calculate the descriptors, the SMILES of the following compounds had to be fixed due to incorrect aromaticity: 1,2,4-Triazole, 3-nitro-1,2,4-triazole, 2-methylbenzimidazole, 2-ethylanthraquinone, 2-methylimidazole, 4-methylimidazole, 3-methylindole, lansoprazole, benzotriazole, thiabendazole, pyrazole, amitrole, cimetidine, and valsartan. Once the SMILES were fixed, the CDK Descriptor Calculator still failed (not unsurprisingly) to produce descriptors for the following compounds: dihydrooxaphospha phenanthrene oxide, hexamethylhexahydrocyclohexasiloxane, pentamethylpentahydrocyclopentasiloxane, bis(trimethylsiloxy)methylsilane, tris(trimethylsiloxy)silane, phenylsilane, 1,1,3,3-tetramethyldisiloxane, hexamethyldisilazane, and tetraethyllead.
R was used to remove rows and columns with all NAs
The columns corresponding to the descriptors for HybRatio and Kier3 were then deleted due to a significant number of compounds having NA values (22 and 51 respectively). Several compounds (hydrogen selenide, germanium tetrachloride, tetramethyltin, tetraethyltin, arsine) were then removed due to NAs in the columns for the descriptors BCUTw.1l, BCUTw.1h, BCUTc.1l, BCUTc.1h, and BCUTp.1l. Similarly, the following compounds were removed because they had an NA in the topoShape descriptor: water, ammonia, methane, bromine, iodine, phosphine, hydrogen sulfide, and chlorine.
Highly correlated descriptors were then removed using the following code:
After curation, we were left with a dataset ready for modeling (xlsx: 3376 measurements, 76 descriptors). The data was then split into separate files for corresponding to each Abraham descriptor: E, S, A, B, V, and L (csv), see Table 1. In all files, cells with "-123" correspond to no measurement for that particular descriptor. For duplicate compounds (by CSID), only the most recent (by sortdate) values were kept. Table 1. Descriptive statistics.
Descriptor
Number (N)
Min Value
Max Value
Mean Value
SD
E
2217
-1.137
5.691
0.915
0.788
S
2287
-5.567
9.105
1.029
0.733
A
2287
0.000
9.368
0.182
0.382
B
2274
0.000
5.459
0.513
0.505
V
2287
0.2026
6.863
1.336
0.683
L
2011
-0.836
28.94
6.084
3.128
Random forest models were then creates for each descriptor using R. For example, the random forest model for E was created using the following code.
library("randomForest")
mydata <- read.csv(file="20141007E.csv",head=TRUE,row.names="molID")
mydata.rf <- randomForest(E ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
[output]
Call:
randomForest(formula = E ~ ., data = mydata, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 25
Mean of squared residuals: 0.02583414
% Var explained: 95.84
[output]
## get variable importance plot
varImpPlot(mydata.rf,main="Random Forest Variable Impor[[@https://onsmodels.shinyapps.io/OpenPhysProp/|Open PhysProp Shiny App]] tance")
saveRDS(mydata.rf, file = "ADE004")
The most important descriptors (TopoPSA, nHBDon, BCUTw.1h) can be seen on the following plot.
Results
Random forest models were created for all descriptors. In general, the models perform very well. The model for S could be better and the model for A is poor. This may be due to the fact that A is taken to be zero if the compound has zero hydrogen bond donors. The models can be downloaded by clicking on the link in Table 2 below. The models are included in the Open PhyProp Shiny App.
*Table 2. Summary of Results. Statistics are for OOB.*
AD004
A CDK-based random forest model for Abraham descriptors.Researchers: Andrew SID Lang
Introduction
The Abraham general solvation models can used to predict partition coefficients via the linear fre energy relationships:
log P = c + e*E + s*S + a*A + b*B + v*V
log K = c + e*E + s*S + a*A + b*B + l*L
where c, e, s, a, b, v, and l are the solvent coefficients and E, S, A, B, V, and L and the solute descriptors.
Here we use Open Data to build random forest models that can be used to predict E, S, A, B, V, and L.
Procedure
We took the latest available data of compounds with known Abraham descriptors [1] and combined it with compounds with Abraham descriptors calculated on-the-fly from ONSChallenge solubility data [2]. This resulted in a dataset (xlsx) of 3400 rows. CDK descriptors were then calculated using Rajarshi Guha's CDK Desc Calc Gui (v1.3.9). In order to calculate the descriptors, the SMILES of the following compounds had to be fixed due to incorrect aromaticity: 1,2,4-Triazole, 3-nitro-1,2,4-triazole, 2-methylbenzimidazole, 2-ethylanthraquinone, 2-methylimidazole, 4-methylimidazole, 3-methylindole, lansoprazole, benzotriazole, thiabendazole, pyrazole, amitrole, cimetidine, and valsartan. Once the SMILES were fixed, the CDK Descriptor Calculator still failed (not unsurprisingly) to produce descriptors for the following compounds: dihydrooxaphospha phenanthrene oxide, hexamethylhexahydrocyclohexasiloxane, pentamethylpentahydrocyclopentasiloxane, bis(trimethylsiloxy)methylsilane, tris(trimethylsiloxy)silane, phenylsilane, 1,1,3,3-tetramethyldisiloxane, hexamethyldisilazane, and tetraethyllead.
R was used to remove rows and columns with all NAs
library("caret") mydata <- read.csv(file="descriptors.csv",head=TRUE,row.names="molID") ncol(mydata) ## 274 columns mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ncol(mydata) ## 209 columns write.csv(mydata,file="descriptorsNoNA.csv")The columns corresponding to the descriptors for HybRatio and Kier3 were then deleted due to a significant number of compounds having NA values (22 and 51 respectively). Several compounds (hydrogen selenide, germanium tetrachloride, tetramethyltin, tetraethyltin, arsine) were then removed due to NAs in the columns for the descriptors BCUTw.1l, BCUTw.1h, BCUTc.1l, BCUTc.1h, and BCUTp.1l. Similarly, the following compounds were removed because they had an NA in the topoShape descriptor: water, ammonia, methane, bromine, iodine, phosphine, hydrogen sulfide, and chlorine.
Highly correlated descriptors were then removed using the following code:
library("caret") mydata <- read.csv(file="DescriptorsreadyForFeatureSelection.csv",head=TRUE,row.names="molID") ncol(mydata) ## 274 columns ## remove all columns with near zero variance mydata.nzv <- nearZeroVar(mydata) mydata <- mydata[, -mydata.nzv] ncol(mydata) ## 119 cols descrCorr <- cor(mydata); highCorr <- findCorrelation(descrCorr, 0.90, verbose=T); ## remove highly correlated descriptors mydata <- mydata[, -highCorr]; ncol(mydata) ## 76 columns write.csv(mydata,file="DescriptorsFeatureSelected.csv")After curation, we were left with a dataset ready for modeling (xlsx: 3376 measurements, 76 descriptors). The data was then split into separate files for corresponding to each Abraham descriptor: E, S, A, B, V, and L (csv), see Table 1. In all files, cells with "-123" correspond to no measurement for that particular descriptor. For duplicate compounds (by CSID), only the most recent (by sortdate) values were kept.
Table 1. Descriptive statistics.
library("randomForest") mydata <- read.csv(file="20141007E.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(E ~ ., data = mydata,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = E ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 25 Mean of squared residuals: 0.02583414 % Var explained: 95.84 [output] ## get variable importance plot varImpPlot(mydata.rf,main="Random Forest Variable Impor[[@https://onsmodels.shinyapps.io/OpenPhysProp/|Open PhysProp Shiny App]] tance") saveRDS(mydata.rf, file = "ADE004")The most important descriptors (TopoPSA, nHBDon, BCUTw.1h) can be seen on the following plot.
Results
Random forest models were created for all descriptors. In general, the models perform very well. The model for S could be better and the model for A is poor. This may be due to the fact that A is taken to be zero if the compound has zero hydrogen bond donors. The models can be downloaded by clicking on the link in Table 2 below. The models are included in the Open PhyProp Shiny App.*Table 2. Summary of Results. Statistics are for OOB.*
References
1. Bradley, Jean-Claude; Acree, William; Lang, Andrew (2014): Compounds with known Abraham descriptors. figshare: http://dx.doi.org/10.6084/m9.figshare.1176994 . Retrieved 15:48, Oct 17, 2014 (GMT)2. Bradley, Jean-Claude. Abraham Descriptors from ONSChallenge solubility data. Retrieved 15:48, Oct 17, 2014 (GMT)